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ABSTRACT 

Closeness is an important centrality measure widely used in 
the analysis of real-world complex networks. In particular, 
the problem of selecting the k most central nodes with re¬ 
spect to this measure has been deeply analyzed in the last 
decade. However, even for not very large networks, this 
problem is computationally intractable in practice: indeed, 
Abboud et al have recently shown that its complexity is 
strictly related to the complexity of the All-Pairs Shortest 
Path (in short, APSP) problem, for which no subcubic “com¬ 
binatorial” algorithm is known. In this paper, we propose 
a new algorithm for selecting the k most closeness central 
nodes in a graph. In practice, this algorithm significantly im¬ 
proves over the APSP approach, even though its worst-case 
time complexity is the same. For example, the algorithm is 
able to compute the top k nodes in few dozens of seconds 
even when applied to real-world networks with millions of 
nodes and edges. We will also experimentally prove that 
our algorithm drastically outperforms the most recently de¬ 
signed algorithm, proposed by Olsen et al. Finally, we apply 
the new algorithm to the computation of the most central 
actors in the IMDB collaboration network, where two actors 
are linked if they played together in a movie. 

Categories and Subject Descriptors 

G.2.2 [Discrete Mathematics]: Graph Theory —graph al¬ 
gorithms', H.2.8 [Database Management]: Database Ap¬ 
plications —data mining', 1.1.2 [Computing Methodolo¬ 
gies]: Algorithms —analysis of algorithms 

General Terms 

Design, Algorithms, Performance 

1. INTRODUCTION 

The problem of identifying the most central nodes in a net¬ 
work is a fundamental question that has been asked many 


times in a plethora of research areas, such as biology, com¬ 
puter science, sociology, and psychology. Because of the im¬ 
portance of this question, several centrality measures have 
been introduced in the literature (for a recent survey, see [3]). 
Closeness is certainly one of the oldest and of the most 
widely used among these measures [2]. Informally, the close¬ 
ness of a node v is the inverse of the expected distance be¬ 
tween v and a random node w, and it somehow estimates the 
efficiency of a node in spreading information to other nodes: 
the larger the closeness, the most “influential” the node is. 

Formally, given a (directed strongly) connected graph G = 
( V, E ), the closeness of a vertex v is defined as c(v) = y^y, 
where n = |V|, f{v) = ^2 weV d(v,w) is the famess of v, 
and d(v, w) is the distance between the two vertices v and 
w (that is, the number of edges in a shortest path from 
v to w). If G is not (strongly) connected, the definition is 
more complicated, because d(v, w) is defined only for vertices 
reachable from v. Let R(v) be the set of these vertices, 
and let r(v) denote its cardinality (note that v € R(v) by 
definition). In the literature [15, 21, 17], the most common 
generalization is 

. . r(v) — 1 r(v) — 1 (r(v) — l) 2 

civ) = —-— --—-- — ——-—---— 

f(v) n — 1 (n — l)f(v) 
where f(v) = J2weR(v) d ( v > w )■ 

In order to compute the k vertices with largest closeness, 
the textbook algorithm computes c(u) for each v and re¬ 
turns the k largest found values. The main bottleneck of 
this approach is the computation of d(v,w), for each pair 
of vertices v and w (that is, solving the All-Pairs Shortest 
Paths or APSP problem). This can be done in two ways: 
either by using fast matrix multiplication, in time 0(n 2 ' 373 ) 
[22], or by performing a breadth-first search (in short, BFS) 
from each vertex v £ V, in time 0(mn), where m = \E\. 
Usually, the BFS approach is preferred because the other 
approach contains big constants hidden in the O notation, 
and because real-world networks are usually sparse, that is, 
m is not much bigger than n. However, also this approach 
is too time-consuming if the input graph is very big (with 
millions of nodes and hundreds of millions of edges). Our 
algorithm heavily improves the BFS approach by “cutting 
the BFSes”, through an efficient pruning procedure, which 
allows us to either compute the closeness of a node v or stop 
the corresponding BFS as soon as we are sure that the close¬ 
ness of v is not among the k largest ones. The worst-case 




Table 1: notations used throughout the paper. 


Name 

Symbol 

Definition 

Graphs 

G = (V,E) 

Graph with node/vertex set V and edge/arc set E 

n 

\v\ ' ' “ ' 1 

> 

II 

Oi 

Weighted DAG of strongly connected components (see Section 1.3) 

Degree functions 

deg(u) 

Degree of a node in an undirected graph 

outdeg(n) 

Out-degree of a node in a directed graph 

Distance function 

d(v, in) 

Number of edges in a shortest path from v to in 

Reachability set function 

R(v) 

Set of nodes reachable from v (by definition, v £ R{v)) 

r(v) 

"T^Ml ' ' ' ~ 

a(v) 

Lower bound on r(v), that is, a(v) < r(v) (see Section 2.4) 

uj ( v ) 

Upper bound on r(v), that is, r(v) < cn(«) (see Section 2.4) 

Neighborhood functions 

r d (v) 

Set of nodes at distance d from v, that is, {in £ V : d(v, in) = d} 

7 d(v) 

Number of nodes at distance d from v, that is, |l’d(v)| 

7d+i(v) 

Upper bound on 7d+i(v), defined as 2J u er M deg(u) if the graph is 
undirected, outdeg(u) otherwise (clearly, 7d+i(u) < 7d+i(v)) 

Ball functions 

N d (v) 

Set of nodes at distance at most d from v, that is, {in £ V : d(v, w) < d} 

n d {v) 

Number of nodes at distance at most d from v, that is, |Ad(u)| 

Farness functions 

f(v) 

Farness of node v, that is, Xweij(D) d(v, in) 

fd{v) 

Farness of node v up to distance d, that is, („) d(v, in) 

fd(v,x ) 

Lower bound function on farness of node v, that is, fd{v) — 7d+i(u) + 

(d + 2)(x — n d (v)) (see Lemma 1) 

Closeness functions 

c{v) 

Closeness of node v, tha is, ^ Trf^T) 

Cd{v) 

Upper bound on closeness of node v, that is, -— (see Corol- 

( n-l)f d (v,r(v )) 

lary 1) 


time complexity of this algorithm is the same as the one of 
the textbook algorithm: indeed, under reasonable complex¬ 
ity assumptions, this worst-case bound cannot be improved. 
In particular, it was proved in [1] that the complexity of 
finding the most central vertex is at least the complexity 
of the APSP: faster combinatorial algorithms for this latter 
problem would imply unexpected breakthrough in complex¬ 
ity theory [23]. However, in practice, our algorithm heavily 
improves the APSP approach, as shown by our experiments. 
Finally, we will apply our algorithm to an interesting and 
prototypical case study, that is, the IMDB actor collabora¬ 
tion network: in a little more than half an hour, we have 
been able to identify the evolution of the set of the 10 most 
central actors, by analysing snapshots of this network taken 
every 5 years, starting from 1940 and up to 2014. 

1.1 Related Work 

Closeness is a “traditional” definition of centrality, and con¬ 
sequently it was not “designed with scalability in mind”, as 
stated in [12]. Also in [5], it is said that closeness central¬ 
ity can “identify influential nodes”, but it is “incapable to 
be applied in large-scale networks due to the computational 
complexity”. The simplest solution considered was to de¬ 
fine different measures, that might be related to closeness 
centrality [12]. 

A different line of research has tried to develop more efficient 
algorithms for this problem. As already said, due to theoret¬ 
ical bounds [4, 1], the worst-case complexity cannot be im¬ 
proved in general, unless widely believed conjectures prove 
to be false. Nevertheless, it is possible to design approxima¬ 
tion algorithms: the simplest approach samples the distance 


between a node v and l other nodes in, and return the aver¬ 
age of all values d(v,w) found [10]. The time-complexity is 
0(lm) to approximate the centrality of all nodes. More re¬ 
fined approximation algorithms are provided in [7, 6], based 
on the concept of All-Distance Sketch, that is, a procedure 
that computes the distances between v and C>(logn) other 
nodes, chosen in order to provide good estimates of the close¬ 
ness centrality of v. Even if these approximation algorithms 
work quite well, they are not suited to the ranking of nodes, 
because the difference between the closeness centrality of 
two nodes might be very small. Nevertheless, they were 
used in [16], where the sampling technique developed in [10] 
was used to actually compute the top k vertices: the re¬ 
sult is not exact, but it is exact with high probability. The 
authors proved that the time-complexity of their algorithm 
is 0(mn 3 logn), under the rather strong assumption that 
closeness centralities are uniformly distributed between 0 
and D, where D is the maximum distance between two nodes 
(in the worst case, the time-complexity of this algorithm is 
(D(mn)). 

Other approaches have tried to develop incremental algo¬ 
rithms, that might be more suited to real-world networks 
analyses. For instance, in [14], the authors develop heuristics 
to determine the k most central vertices in a varying environ¬ 
ment. A different work addressed the problem of updating 
centralities after edge insertion or deletion [18]: for instance, 
it is shown that it is possible to update the closeness central¬ 
ity of 1.2 million authors in the DBLP-coauthorship network 
460 times faster than recomputing it from scratch. 

Finally, some works have tried to exploit properties of real- 




world networks in order to find more efficient algorithms. 
In [13], the authors develop a heuristic to compute the k 
most central vertices according to different measures. The 
basic idea is to identify central nodes according to a sim¬ 
ple centrality measure (for instance, degree of nodes), and 
then to inspect a small set of central nodes according to this 
measure, hoping it will contain the top k vertices according 
to the “complex” measure. Another approach [17] tried to 
exploit the properties of real-world networks in order to de¬ 
velop exact algorithms with worst-case complexity 0(mn), 
but performing much better in practice. As far as we know, 
this is the only exact algorithm that is able to efficiently 
compute the k most central vertices in networks with up to 
1 million nodes. 

However, despite this huge amount of research, the ma¬ 
jor graph libraries still implement the textbook algorithm: 
among them, Boost Graph Library [11], Sagemath [9], 
igrapli [20], and NetworkX [19]. This is due to the fact 
that the only efficient algorithm published until now for top 
k closeness centrality is [17], which was published only 1 
year ago, and it is quite complicated, because it is based on 
several other algorithms. 

1.2 Our Results 

As we said before, in this paper we present a new and very 
simple algorithm for the exact computation of top k close¬ 
ness central vertices. We show that we drastically improve 
both the probabilistic approach [16], and the best algorithm 
available until now [17]. We have computed for the first 
time the 10 most central nodes in networks with millions of 
nodes and hundreds of millions of edges, in very little time. 
A significant example is the IMDB actor network (1 797 446 
nodes and 72 880 156 edges): we have computed the 10 most 
central actors in less than 10 minutes. Moreover, in our 
DBLP co-authorship network (which should be quite sim¬ 
ilar to the network used in [18]), our performance is more 
than 6 000 times better than the performance of the text¬ 
book algorithm: if only the most central node is needed, 
we can recompute it from scratch more than 10 times faster 
than performing their update. Finally, our approach is not 
only very efficient, but it is also very easy to code, making it 
a very good candidate to be implemented in existing graph 
libraries. 

1.3 Preliminary Definitions 

We assume the reader to be familiar with the basic notions 
of graph theory (see, for example, [8]): all the notations 
and definitions used throughout this paper are summarised 
in Table 1. In addition to these definitions, let us pre¬ 
cisely define the weighted directed acyclic graph Q = (V, £) 
of strongly connected components (in short, SCCs) corre¬ 
sponding to a directed graph G = (V,E). In this graph, V 
is the set of SCCs of G, and, for any two SCCs C, D € V, 
( C , D) € £ if and only if there is an arc in E from a node in 
C to the a node in D. For each SCC C G V, the weight w(C) 
of C is equal to |Cj, that is, the number of nodes in the SCC 
C. Note that, for each node v G C, r(v) = Y1 d^r(C) w (D), 
where R(C) denotes the set of SCCs that are reachable from 
C in Q, and r(C) denotes its cardinality. 

1.4 Structure of the Paper 


In Section 2, we will explain how our algorithm works, and 
we will prove its correctness. Section 3 will experimentally 
prove that our algorithm outperforms the best available al¬ 
gorithms, by performing several tests on a dataset of real- 
world networks. Section 4 applies the new algorithm to the 
analysis of the 10 most central actors in the IMDB actor 
network. 

2. THE ALGORITHM 
2.1 Overview 

As we already said, the textbook algorithm for comput¬ 
ing the k vertices with largest closeness performs a BFS 
from each vertex v, computes its closeness c(v), and, finally, 
returns the k vertices with biggest c{v ) values. Similarly, 
our algorithm (see Algorithm 1) sets c(u) = BFSCut(v, Xk), 
where Xk is the fc-th biggest closeness value found until now 
(Xk = 0 if we have not processed at least k vertices). If 
BFSCut(v, Xk) = 0, it means that v is not one of k most cen¬ 
tral vertices, otherwise c(v) is the actual closeness of v. This 
means that, at the end, the k vertices with biggest closeness 
values are again the k most central vertices. In order to 
speed-up the function BFSCut(v, Xk), we want Xk to be as 
big as possible, and consequently we need to process central 
vertices as soon as possible. To this purpose, following the 
idea of [14], we process vertices in decreasing order of degree. 


Algorithm 1: overview of the new algorithm. The function 
Kth(c) returns the fc-th biggest element of c and the function 
TopK(c) returns the k biggest elements. 

1 Preprocessing (G); // see Section 2.4 

2 c(v) «— 0 for each v ; 

3 Xk ■ f- 0; 

4 for v £ V in decreasing order of degree do 

5 c(v) t— BFSCut ( v,Xk ); 

6 if c(v) ^ 0 then 

7 ]_ Xk <- Kth(c); 

8 return TopK(c); 


As we will see in the next sections, Algorithm 1 needs linear 
time in order to execute the preprocessing. Moreover, it 
requires time 0(n log n) to sort vertices, G(log k) to perform 
function Kth, and 0( 1) to perform function TopK, by using 
a priority queue containing at each step the k most central 
vertices. Since all other operations need time 0(1), the total 
running time is 0(m+n log n+n log k+T) = 0(m+n log n+ 
T)), where T is the time needed to perform the function 
BFSCut n times. 

Before explaining in details how the function BFSCut op¬ 
erates, let us note that we can easily parallelise the for 
loop in Algorithm 1, by giving each vertex to a different 
thread. The parallelisation is almost complete, because the 
only “non-parallel” part deals with the shared variable Xk, 
which must be updated in a synchronised way. In any case, 
this shared variable does not affect performances. First of 
all, the time needed for the update is very small, compared 
to the time of BFSes. Moreover, in principle, a thread might 
use an “old” value of Xk, and consequently lose performance: 
we will experimentally show that the effect of this positive 
race condition is negligible (Section 3.2). 



2.2 An Upper Bound on the Closeness of 
Nodes 

The goal of this section is to define an upper bound c v () on 
the closeness of a node v, which has to be updated whenever, 
for any d > 0, all nodes in T d (v) has been reached by the 
BFS starting from v (that is, whenever the exploration of 
the d-tli level of the BFS tree is finished). More precisely, 
this upper bound is obtained by first proving a lower bound 
on the farness of v, as shown in the following lemma. 


Lemma 1. For each vertex v and for each d > 0, 
f(v) > fd(v,r(v )) 

(see Table 1 for the definition of f d (v,r(v))). 

Proof. From the definitions of Table 1, it follows that 
f(v) > }d{v) + (d + l) 7 d+i(v) + (d + 2 )(r(v) - n d+1 {v)). 

Since n d+1 (v) = 7d+i(v) + n d (v), 

f(v) > f d (v) - y d +i(v) + (d + 2 )(r(v) - n d (v)). 

Finally, since 7d+i(v) is an upper bound on -y d+ i(v), 
f(v) > fd(v) - 7 d+i(«) + (d + 2)(r(v) -n d (v)) = f d (v,r(v)) 
and the lemma follows. □ 


Lemma 2. For each vertex v and for each d > 0, 


c(v) 


> X d {v) := (n 


1) min 


( fd{v,a(v)) fd(v,u(v)) \ 
\{a(v)-l)*'( U {v)-I)*)' 


Proof. From Lemma 1, if we denote a = d + 2 and b = 
Td+iiv) + a(n d (v) - 1) - f d (v), 

f{v) > fd(v) - 7d+i(w) + a(r(v) - n d (v)) 

= a(r(v) - 1) + fd(v) - 7d+i(v) - a(n d {v) - 1) 

= a(r(v) — 1) — b. 

Note that a > 0 because d > 0, and b > 0 because 

fd(v) = ^2 d(v, w) < d(n d (v) - 1) < a(n d (v) - 1) 

weN d (v) 

where the first inequality holds because, if w = v, then 
d(v,w) = 0, and if w G N d (v), then d(v,w) < d. Hence, 

— > (n — l) . Let us consider the function 

g(x) = a Fj b . The derivative g'(x) = ~ n ^ 26 is positive 
for 0 < x < ^ and negative for x > —: this means that 

— is a local maximum, and there are no local minima for 

x > 0. Consequently, in each closed interval [xi,X 2 ] where 

xi and X 2 are positive, the minimum of g(x) is reached in 
xi or X 2 . Since 0 < a(v) — 1 < r(v) — 1 < oj(v) — 1, 

g(r(v) - 1) > min(p(a(n) - l),g(u>(v) - 1)) 

and the conclusion follows. □ 


Corollary 1. For each vertex v and for each d > 0, 
c(v) < c d {v), where c d (v ) is defined in Table 1. 

2.3 Computing the Upper Bound 

Apart from r(v), all quantities necessary to compute 
fd(v,r(v)) (and, hence, to compute the upper bound of 
Lemma 1) are available as soon as all vertices in N d (v) are 
visited by a BFS. Note that, if the graph G is (strongly) 
connected, then r(v) = n is also available. Moreover, if the 
graph G is undirected (but not necessarily connected), we 
can compute r(v) for each vertex v in linear time, at the 
beginning of the algorithm (by simply computing the con¬ 
nected components of G). 1 It thus remain to deal with the 
case in which G is directed and not strongly connected. 

In this case, let us assume, for now, that we know a lower 
(respectively, upper) bound a(v) (respectively, u>(v)) on r(v) 
(see also Table 1): without loss of generality we can assume 
that a(v) > 1. The next lemma shows that, instead of ex¬ 
amining all possible values of r(v) between a(v) and cj(v), it 
is sufficient to examine only the two extremes of this inter¬ 
val. In order to apply this idea, however, we have to work 
with the inverse of the closeness of v, because otherwise 
the denominator might vanish or even be negative (due to 
the fact that we are both upper bounding y d+ i(v) and lower 
bounding r(v)). In other words, the lemma will provide us 
a lower bound X d (v ) on (so that, if X d (v) is negative, 
then A d [v) < trivially hol ds). 

x Note that if G is undirected, Lemma 1 still holds if we rede¬ 
fine, for any d > 1, Td+i(v) = E ue r d (v)( de g( M ) _1 )> because 
at least one edge from each vertex u in r t j(u) connects u to 
a node in r d -i(v). 


2.4 Computing a(v) and uj(v) 

It now remains to compute a(n) and (3{v) (in the case of 
a directed graph which is not strongly connected). This 
can be done during the preprocessing phase of our algo¬ 
rithm as follows. Let Q = (V,£) be the weighted directed 
acyclic graph of SCCs, as defined in Section 1.3. We al¬ 
ready observed that, if v and w are in the same SCC, then 
r(v) = r(w ) = Z (c) W (D), where TZ(C) denotes the set 
of SCCs that are reachable from C in Q. This means that 
we simply need to compute a lower (respectively, upper) 
bound a(C) (respectively, to(C)) on K(C) w (D), for ev¬ 
ery SCC C. To this aim, we first compute a topological sort 
{Ci,..., Ci} of V (that is, if ( Ci,Cj ) G £, then i < j ). Suc¬ 
cessively, we use a dynamic programming approach, and, by 
starting from Ci, we process the SCCs in reverse topological 
order, and we set 

q(C) = w(C) + max a(D) 

( c,D)ee 

and 

u(C) = w{C) + J2 

0 c,D)es 

Note that processing the SCCs in reverse topological order¬ 
ing ensures that the values a(D) and oj(D) on the right 
hand side of these equalities are available when we process 
the SCC C. Clearly, the complexity of computing a(C) and 
lo(C), for each SCC C, is linear in the size of <5, which in 
turn is smaller than G. 

Observe that the bounds obtained through this simple ap¬ 
proach can be improved by using some “tricks”. First of all, 
when the biggest SCC C is processed, we do not use the dy¬ 
namic programming approach and we can exactly compute 



^Deiz(C) W (D) by simply performing a BFS starting from 
any node in C. This way, not only a{C) and lj(C) are exact, 
but also a(C) and w(C) are improved for each SCC C from 
which it is possible to reach C. Finally, in order to compute 
the upper bounds for the SCCs that are able to reach C, we 
can run the dynamic programming algorithm on the graph 
obtained from Q by removing all components reachable from 
C, and we can then add X^De 7 ?(C) W {D)- 

2.5 The BFSCut Function 

We are now ready to define the function BFSCut (v, x), which 
returns the closeness c(v ) of vertex v if c(v) > x, 0 otherwise. 
To do so, this function performs a BFS starting from v, and 
during the BFS it updates the upper bound Cd(v) > c(v) 
(the update is done whenever all nodes in Fd(t) has been 
reached): as soon as Cd(v ) < x, we know that c(v ) < Cd(v ) < 
x, and we return 0. If this situation never occurs, at the end 
of the visit we have clearly computed c(u). 

Algorithm 2 is the pseudo-code of the function BFSCut when 
implemented for strongly connected graphs (recall that, in 
this case, r(v) = n): this code can be easily adapted to the 
case of undirected graphs (see the beginning of Section 2.3) 
and to the case of directed (not necessarily strongly con¬ 
nected) graphs (see Lemma 2). 


Xeon(R) CPU E5-4607 0, 2.20GHz, with 48 cores, 250GB 
RAM, running Ubuntu 14.04_2 LTS; our code has been writ¬ 
ten in Java 1.7, and it is available at tinyurl.com/kelujv7. 

3.1 Comparison with the State of the Art 

In order to compare the performance of our algorithm with 
state of the art approaches, we have selected 21 networks 
whose number of nodes ranges between 1 000 and 30 000 
nodes. 

We have compared our algorithm Bcm with our implemen¬ 
tations of the best available algorithms for top-fc closeness 
centrality . 2 The first one [17] is based on a pruning technique 
and on A-BFS, a method to reuse information collected dur¬ 
ing a BFS from a node to speed up a BFS from one of its 
in-neighbors; we will denote this algorithm as Oli-i. The 
second one provides top-fc closeness centralities with high 
probability [16]. It is based on performing BFSes from a 
random sample of nodes to estimate the closeness central¬ 
ity of all the other nodes, then computing the exact cen¬ 
trality of all the nodes whose estimate is big enough. Note 
that this algorithm requires the input graph to be (strongly) 
connected: for this reason, differently from the other algo¬ 
rithms, we have run this algorithm on the largest (strongly) 
connected component. We will denote this latter algorithm 
as Ocl. 


Algorithm 2: the BFSCut(u, x) function in the case of 
strongly connected directed graphs. 

1 Create queue Q\ 

2 Q.enqueue(u); 

3 Mark v as visited; 

4 d <— 0; / <— 0; j <— 0; nd 0; 

5 while Q is not empty do 

6 

7 

8 
9 

10 
11 

12 

13 

14 

15 

16 

17 

18 

19 

20 return ^-p-; 


u <— Q.dequeueQ; 
if d(v, u) > d then 
I / <- /- 7 + (d + 2 )(n-nd); 


if c < x then 
return 0; 

d ^— d -)- 1 ; 

/ <- f + d(v,u); 

7 <— 7 + outdeg(u); 
nd <— nd + 1 ; 

for w in adjacency list of u do 
if w is not visited then 
Q.enqueue(ui); 

Mark w as visited; 


In order to perform a fair comparison we have considered 
the improvement factor ppp, where m vis is the number of 
arcs visited during the algorithm and mtot is the number of 
arcs visited by the textbook algorithm, which is based on all¬ 
pair shortest-patli. It is worth observing that mtot is m ■ n if 
the graph is (strongly) connected, but it might be smaller in 
general. The improvement factor does not consider the pre¬ 
processing time, which is negligible, and it is closely related 
to the running-time, since all the three algorithms are based 
on performing BFSes. Furthermore, it does not depend on 
the implementation and on the machine used for the algo¬ 
rithm. In the particular case of Olh, we have just counted 
the arcs visited in BFS and A-BFS, without considering all 
the operations done in the pruning phases (see [17]). 

Our results are summarized in Table 2, where we report the 
arithmetic mean and the geometric mean of all the improve¬ 
ment factors (among all the graphs). In our opinion, the 
geometric mean is more significant, because the arithmetic 
mean is highly influenced by the maximum value. 

Table 2: comparison of the improvement factors of the new 
algorithm (Bcm), of the algorithm in [17] (Olh), and the 
algorithm in [16] (Ocl). Values are computed for k — 1 and 
k = 10 , and they are averaged over all graphs in the dataset. 


3. EXPERIMENTAL RESULTS 

In this section, we will test our algorithm on sev¬ 
eral real-world networks, in order to show its per¬ 
formances. All the networks used in our exper¬ 

iments were collected from the datasets SNAP 
(snap.stanford.edu/), NEXUS (nexus.igraph.org), 
LASAGNE (piluc.dsi.unifi.it/lasagne), LAW 

(law.di.unimi.it), and IMDB (www.imdb.com). Our 
tests have been performed on a server running an Intel(R) 




Arithmetic Mean 

Geometric Mean 

k 

Alg 

Dir 

Undir 

Both 

Dir 

Undir 

Both 


Bcm 

4.5% 

2.8% 

3.46% 

1.3% 

0.8% 

0.9% 

1 

Olh 

43.5% 

24.2% 

31.6% 

35.6% 

15.8% 

21.5% 


Ocl 

72.3% 

45.4% 

55.6% 

67.3% 

42.8% 

50.8% 


Bcm 

14.1% 

5.3% 

8.6% 

6.3% 

2.7% 

3.8% 

10 

Olh 

43.6% 

24.2% 

31.6% 

35.6% 

15.8% 

21.6% 


Ocl 

80.7% 

59.3% 

67.5% 

78.4% 

57.9% 

65.0% 


2 Note that the source code of our competitors is not avail¬ 
able. 




Table 3: detailed comparison of the improvement factor of the three algorithms with respect to the all-pair-shortest-path 
algorithm, with k = 1 and k = 10. 


Directed Networks 


Network 

Nodes 

Edges 

Bom 

k — 1 
Olh 

Ocl 

Bcm 

k = 10 
Olh 

Ocl 

polblogs 

1224 

19022 

3.052% 

41.131% 

88.323% 

8.491% 

41.321% 

91.992% 

p2p-Gnutella08 

6301 

20777 

4.592% 

53.535% 

87.229% 

23.646% 

53.626% 

92.350% 

wiki-Vote 

7115 

103689 

0.068% 

25.205% 

40.069% 

0.825% 

25.226% 

62.262% 

p2p-Gnutella09 

8114 

26013 

7.458% 

55.754% 

86.867% 

18.649% 

55.940% 

90.248% 

p2p-Gnutella06 

8717 

31525 

0.808% 

52.615% 

77.768% 

18.432% 

52.831% 

88.884% 

freeassoc 

10617 

72172 

17.315% 

58.204% 

85.831% 

20.640% 

57.954% 

87.300% 

p2p-Gnutella04 

10876 

39994 

2.575% 

56.788% 

84.128% 

21.754% 

56.813% 

89.961% 

as-caida20071105 

26475 

106762 

0.036% 

4.740% 

27.985% 

0.100% 

4.740% 

42.955% 


Undirected Networks 


Network 

Nodes 

Edges 

Bom 

k = 1 
Olh 

Ocl 

Bcm 

k = 10 
Olh 

Ocl 

Homo 

1027 

1166 

5.259% 

82.794% 

82.956% 

14.121% 

82.794% 

88.076% 

HC-BIOGRID 

4039 

10321 

5.914% 

19.112% 

65.672% 

8.928% 

19.112% 

72.070% 

Mus_musculus 

4610 

5747 

1.352% 

7.535% 

55.004% 

5.135% 

7.535% 

66.507% 

Caenorhabditis^elegans 

4723 

9842 

1.161% 

9.489% 

45.623% 

1.749% 

9.489% 

58.521% 

ca-GrQc 

5242 

14484 

3.472% 

13.815% 

55.099% 

5.115% 

13.815% 

62.523% 

advogato 

7418 

42892 

0.427% 

82.757% 

41.364% 

0.891% 

82.757% 

61.688% 

hprd_pp 

9465 

37039 

0.219% 

15.827% 

44.084% 

2.079% 

15.827% 

54.300% 

ca-HepTh 

9877 

25973 

2.796% 

15.474% 

46.257% 

3.630% 

15.474% 

52.196% 

Drosophila^melanogaster 

10625 

40781 

1.454% 

18.347% 

40.513% 

1.991% 

18.347% 

46.847% 

oregonl_010526 

11174 

23409 

0.058% 

4.937% 

28.221% 

0.233% 

4.937% 

49.966% 

oregon2_010526 

11461 

32730 

0.090% 

5.848% 

23.780% 

0.269% 

5.848% 

40.102% 

GoogleNw 

15763 

148585 

0.007% 

7.377% 

33.501% 

4.438% 

7.377% 

75.516% 

dip20090126_MAX 

19928 

41202 

14.610% 

31.627% 

27.727% 

20.097% 

31.673% 

42.901% 


More detailed results are available in Table 3. 

In the case k = 1 (respectively, k = 10), the geometric 
mean of the improvement factor of Bcm is 23 (resp. 6) times 
smaller than Olh and 54 times smaller than Ocl (resp. 17). 
Moreover we highlight that the new algorithm outperforms 
all the competitors in each single graph, both with k = 1 
and with k = 10. 

We have also tested our algorithm on the three unweighted 
graphs analyzed in [17], respectively called Web, Wiki, and 
DBLP. By using a single thread implementation of Bcm, in 
the Web graph (resp. DBLP) we computed the top-10 nodes 
in 10 minutes (resp. 10 minutes) on the whole graph, having 
875 713 nodes (resp. 1305 444), while Olh needed about 25 
minutes (resp. 4 hours) for a subgraph of 400 000 nodes. The 
most striking result deals with Wiki, where Bcm needed 30 
seconds for the whole graph having 2 394 385 nodes instead 
of about 15 minutes on a subgraph with 1 million nodes. 
Using multiple threads our performances are even better, as 
we will show in Section 3.3. 

3.2 Real-World Large Networks 

In this section, we will run our algorithm on a bigger dataset, 
composed by 25 directed and 15 undirected networks, with 
up to 7 414 768 nodes and 191 606 827 edges. Once again, we 
will consider the number of visited arcs by Bcm, i.e. m v i s , 
but this time we will analyze the performance ratio 
instead of the improvement factor. Indeed, due to the large 
size of these networks, the textbook algorithm did not finish 
in a reasonable time. 

It is worth observing that we have been able to compute 
for the first time the k most central nodes of networks with 
millions of nodes and hundreds of millions of arcs, with k = 1 
and k = 10. The detailed results are shown in Table 5, where 
for each network we have reported the performance ratio, 


Table 4: the arithmetic and geometric mean of the perfor¬ 
mance ratios (percentage). 


k 

Arii 

Dir 

'HMETIC IV 
Undir 

Iean 

Both 

Geo 

Dir 

METRIC M 

Undir 

[ean 

Both 

1 

2.89% 

1.14% 

2.24% 

0.36% 

0.12% 

0.24% 

10 

3.82% 

1.83% 

3.07% 

0.89% 

0.84% 

0.87% 


both for k = 1 and A; = 10. A summary of these results is 
provided by Table 4, providing the arithmetic and geometric 
means. 

First of all, we note that the values obtained are impres¬ 
sive: the geometric mean is always below 1%, and for k = 1 
it is even smaller. The arithmetic mean is slightly bigger, 
mainly because of amazon product-co-purchasing networks, 
two web networks and one collaboration network, where the 
performance ratio is quite high. Most of the other networks 
have a very low performance ratio: with k = 1, 65% of the 
networks are below 1%, and 32.5% of the networks are be¬ 
low 0.1%. With k = 10, 52.5% of the networks are below 
1% and 12.5% are below 0.1%. 

We also outline that in some cases the performance ratio is 
even smaller: a striking example is com-Orkut, where our 
algorithm for k = 1 is more than 40 000 times faster than 
the textbook algorithm, whose performance is m ■ n, because 
the graph is connected. 

3.3 Multi-Thread Experiments 

In this section, we will test the performance of a parallel 
version of Bcm (see Section 2.1). In particular, we have 
considered the ratio between the time needed to compute 
the most central node with one thread and with x threads, 
where x € {1, 2,4, 8,16}. This ratio is plotted in Figure 1 for 
k = 1 (for k = 10 very similar results hold). Ideally, the ratio 
should be very close to the number of threads; however, due 




Table 5: performance ratio of the new algorithm. 


Directed Networks 


Network 

Nodes 

Edges 

Perform a: 
k = 1 

nce Ratio (%) 
k = 10 

cit-HepTh 

27770 

352768 

2.31996 

4.65612 

cit-HepPh 

34546 

421534 

1.21083 

1.64227 

p2p-Gnutella31 

62586 

147892 

0.73753 

2.51074 

soc-Epinionsl 

75879 

508837 

0.16844 

1.92346 

soc-sign-Slashdot081106 

77350 

516575 

0.59535 

0.65012 

soc-Slashdot0811 

77360 

828161 

0.01627 

0.48842 

twitter-combined 

81306 

1768135 

0.76594 

1.03332 

soc-sign-Slashdot090216 

81867 

545671 

0.5375 

0.58774 

soc-sign-Slashdot090221 

82140 

549202 

0.54833 

0.5995 

soc-Slashdot0902 

82168 

870161 

0.01662 

0.76048 

gplus-combined 

107614 

13673453 

0.36511 

0.3896 

amazon0302 

262111 

1234877 

10.16028 

11.90729 

email-EuAll 

265214 

418956 

0.00192 

0.00764 

web-Stanford 

281903 

2312497 

6.55454 

10.67736 

web-NotreDame 

325729 

1469679 

0.04592 

0.62945 

amazon0312 

400727 

3200440 

8.17931 

9.40879 

amazon0601 

403394 

3387388 

7.80459 

9.33853 

amazon0505 

410236 

3356824 

7.81823 

9.11571 

web-BerkStan 

685230 

7600595 

21.0286 

23.70427 

web-Google 

875713 

5105039 

0.13662 

0.23239 

in-2004 

1382870 

16539643 

1.81649 

2.51671 

soc-pokec-relationships 

1632803 

30622564 

0.00411 

0.02257 

wiki-Talk 

2394385 

5021410 

0.00029 

0.00247 

indochina-2004 

7414768 

191606827 

1.341 

2.662 


Undirected Networks 


Network 

Nodes 

Edges 

Perform a: 
k = 1 

nce Ratio (%) 
k = 10 

ca-HepPh 

12008 

118489 

9.41901 

9.57862 

ca-AstroPh 

18772 

198050 

1.35832 

3.2326 

ca-CondMat 

23133 

93439 

0.23165 

1.07725 

dblp-conf2015-net-bigcomp 

31951 

95084 

2.46188 

3.42476 

email-Enron 

36692 

183831 

0.10452 

0.28912 

loc-gowalla-edges 

196591 

950327 

0.00342 

3.04066 

com-dblp. ungraph 

317080 

1049866 

0.20647 

0.312 

com-amazon. ungraph 

334863 

925872 

2.55046 

3.08037 

com-youtube. ungraph 

1134890 

2987624 

0.04487 

0.60811 

dblp22015-net-bigcomp 

1305444 

6108712 

0.01618 

0.0542 

as-skitter 

1696415 

11095298 

0.54523 

0.61078 

com-orkut. ungraph 

3072441 

117185083 

0.00241 

0.38956 


to memory access, the actual ratio is smaller. For graphs 
where the performance ratio is small, like in the case of 
wiki-Talk (see Table 5 in the appendix), the running time 
is mostly consumed by the preprocessing phase (even if it is 
O(nlogn)); in these cases, we observe that there seems to 
be no room for parallelization improvements. On the other 
hand, when the computation is more time consuming, like 
in the case of web-google or as-skitter, the parallelization 
is very efficient and close to the optimum. 

We have also tested if the positive race condition in the 
update of Xk affects performances (see Section 2.1), by con¬ 
sidering the performance ratio with a varying number of 
threads. In Table 6, we report how much the performance 
ratio increases if the algorithm is run with 16 threads in¬ 
stead of 1 (more formally, if p; is the performance ratio with 
i threads, we report P1 ^~ P1 ). For instance, a value of 100% 
means that the number of visited arcs has doubled: ideally, 
this should result in a factor 8 speedup, instead of a factor 
16 (that is, the number of threads). We observe that these 
values are very small, especially for k = 10, where all values 
except one are below 5%. This means that, ideally, instead 
of a factor 16 improvement, we obtain a factor 15.24. The 
only case where this is not verified is wiki-Talk, where in 
any case the performance ratio is very small (see Table 5). 


Table 6: the increase in performance ratio from 1 to 16 
threads. 


Network 

k = 1 

k = 10 

MathSciNet 

3.88% 

0.18% 

com-dblp 

1.32% 

0.74% 

ydata-vl-0 

1.40% 

0.24% 

wiki-Talk 

192.84% 

49.40% 

web-Google 

1.70% 

0.95% 

com-youtube 

1.91% 

0.10% 

dblp22015 

2.54% 

0.47% 

as-skitter 

0.14% 

0.18% 

in-2004 

0.09% 

0.10% 

soc-pokec-relationships 

17.71% 

4.51% 

imdb 

5.61% 

4.03% 


4. IMDB CASE STUDY 

In this section, we will apply the new algorithm Bcm 
to analyze the IMDB graph, where nodes are actors, 
and two actors are connected if they played together in 
a movie (TV-series are ignored). The data collected 
come from the website http://www.imdb.com: in line with 
http://oracleofbacon.org, we decided to exclude some 
genres from our database: awards-shows, documentaries, 
game-shows, news, realities and talk-shows. We analyzed 
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Figure 1: the benefits of parallelization. 


snapshots of the actor graph, taken every 5 years from 1940 
to 2010, and 2014. The total time needed to perform the 
computation was 37 minutes with 30 threads, and the re¬ 
sults are reported in Table 7. 


The Algorithm. The results outline that the performance 
ratio decreased drastically while the graph size increased: 
this suggests that the performances of our algorithm increase 
with the input size. This is even more visible from Figure 2, 
where we have plotted the inverse of the performance ratio 
with respect to the number of nodes. It is clear that the plot 
is very close to a line, especially if we exclude the last two 
values. This means that the performance ratio is close to 
-V. where c is the slope of the line in the plot, and the total 
running time is well approximated by -AO(mn) = 0(m). 
This means that, in practice, for the IMDB actor graph, our 
algorithm is linear in the input size. 


<D 

ft 



Millions of nodes 


Figure 2: growth of performance ratio with respect to the 
number of nodes. 


The Results. In 2014, the most central actor is Michael 
Madsen, whose career spans 25 years and more than 170 
films. Among his most famous appearances, he played as 
Jimmy Lennox in Thelma & Louise (Ridley Scott, 1991), 
as Glen Greenwood in Free Willy (Simon Wincer, 1993), as 
Bob in Sin City (Frank Miller, Robert Rodriguez, Quentin 
Tarantino), and as Deadly Viper Budd in Kill Bill (Quentin 
Tarantino, 2003-2004). It is worth noting that he played in 
movies of very different kinds, and consequently he could 
“reach” many actors in a small amount of steps. The sec¬ 
ond is Danny Trejo, whose most famous movies are Heat 
(Michael Mann, 1995), where he played as Trejo , Machete 
(Ethan Maniquis, Robert Rodriguez, 2010) and Machete 
Kills (Robert Rodriguez, 2013), where he played as Mathete. 
The third “actor” is not really an actor: he is the German 
dictator Adolf Hitler: he was also the most central actor in 
2005 and 2010, and he was in the top-10 since 1990. This a 
consequence of his appearances in several archive footages, 
that were re-used in several movies (he counts 775 credits, 
even if most of them are in documentaries or TV-shows, 
that were eliminated). Among the movies he is credited 
in, we find Zelig (Woody Allen, 1983), and The Imitation 
Game (Morten Tyldum, 2014): obviously, in both movies, 
he played himself. 

Among the other most central actors, we find many people 
who played a lot of movies, and most of them are quite im¬ 
portant actors. However, this ranking does not discriminate 
between important roles and marginal roles: for instance, 
the actress Bess Flowers is not widely known, because she 
rarely played significant roles, but she appeared in over 700 
movies in her 41 years career, and for this reason she was 
the most central for 30 years, between 1950 and 1980. Fi¬ 
nally, it is worth noting that we never find Kevin Bacon in 
the top 10, even if he became famous for the “Six Degrees 
of Kevin Bacon” game http://oracleofbacon.org, where 
the player receives an actor x, and he has to find a path of 
length at most 6 from x to Kevin Bacon in the actor graph. 
Kevin Bacon was chosen as the goal because he played in 
several movies, and he was thought to be one of the most 
central actors: this work shows that, actually, he is quite 





far from being in the top 10. Indeed, his closeness central¬ 
ity is 0.336, while the most central actor, Michael Madsen, 
has centrality 0.354, and the 10th actor, Christopher Lee, 
has centrality 0.350. We have run again the algorithm with 
k = 100, and we have seen that the 100th actor is Rip Torn, 
with closeness centrality 0.341. 
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Table 7: detailed results of the IMDB actor graph. 


Year 

Nodes 

Edges 

Perf. ratio 

1940 

69 011 

3417144 

5.62% 

1945 

83 068 

5 160 584 

4.43% 

1950 

97824 

6 793184 

4.03% 

1955 

120 430 

8 674159 
2.91% 

1st 

Semels, Harry (I) 

Corrado, Gino 

Flowers, Bess 

Flowers, Bess 

2nd 

Corrado, Gino 

Steers, Larry 

Steers, Larry 

Harris, Sam (II) 

3rd 

Steers, Larry 

Flowers, Bess 

Corrado, Gino 

Steers, Larry 

4th 

Bracey, Sidney 

Semels, Harry (I) 

Harris, Sam (II) 

Corrado, Gino 

5th 

Lucas, Wilfred 

White, Leo (I) 

Semels, Harry (I) 

Miller, Harold (I) 

6th 

White, Leo (I) 

Mortimer, Edmund 

Davis, George (I) 

Farnum, Franklyn 

7th 

Martell, Alphonse 

Boteler, Wade 

Magrill, George 

Magrill, George 

8th 

Conti, Albert (I) 

Phelps, Lee (I) 

Phelps, Lee (I) 

Conaty, James 

9th 

Flowers, Bess 

Ring, Cyril 

Ring, Cyril 

Davis, George (I) 

10th 

Sedan, Rolfe 

Bracey, Sidney 

Moorhouse, Bert 

Cording, Harry 


Year 

Nodes 

Edges 

Perf. ratio 

1960 

146 253 

11 197 509 

2.21% 

1965 

174 826 

12 649114 

1.60% 

1970 

210 527 

14 209 908 

1.14% 

1975 

257896 

16 080 065 
0.83% 

1st 

Flowers, Bess 

Flowers, Bess 

Flowers, Bess 

Flowers, Bess 

2nd 

Harris, Sam (II) 

Harris, Sam (II) 

Harris, Sam (II) 

Harris, Sam (II) 

3rd 

Farnum, Franklyn 

Farnum, Franklyn 

Tamiroff, Akim 

Tamiroff, Akim 

4th 

Miller, Harold (I) 

Miller, Harold (I) 

Farnum, Franklyn 

Welles, Orson 

5th 

Chefe, Jack 

Holmes, Stuart 

Miller, Harold (I) 

Sayre, Jeffrey 

6th 

Holmes, Stuart 

Sayre, Jeffrey 

Sayre, Jeffrey 

Miller, Harold (I) 

7th 

Steers, Larry 

Chefe, Jack 

Quinn, Anthony (I) 

Farnum, Franklyn 

8th 

Paris, Manuel 

Paris, Manuel 

O’Brien, William H. 

Kemp, Kenner G. 

9th 

O’Brien, William H. 

O’Brien, William H. 

Holmes, Stuart 

Quinn, Anthony (I) 

10th 

Sayre, Jeffrey 

Stevens, Bert (I) 

Stevens, Bert (I) 

O’Brien, William H. 


Year 

Nodes 

Edges 

Perf. ratio 

1980 

310 278 

18 252 462 

0.62% 

1985 

375 322 

20 970 510 

0.45% 

1990 

463 078 

24 573 288 

0.34% 

1995 

557 373 

28 542 684 
0.26% 

1st 

Flowers, Bess 

Welles, Orson 

Welles, Orson 

Lee, Christopher (I) 

2nd 

Harris, Sam (II) 

Flowers, Bess 

Carradine, John 

Welles, Orson 

3rd 

Welles, Orson 

Harris, Sam (II) 

Flowers, Bess 

Quinn, Anthony (I) 

4th 

Sayre, Jeffrey 

Quinn, Anthony (I) 

Lee, Christopher (I) 

Pleasence, Donald 

5th 

Quinn, Anthony (I) 

Sayre, Jeffrey 

Harris, Sam (II) 

Hitler, Adolf 

6th 

Tamiroff, Akim 

Carradine, John 

Quinn, Anthony (I) 

Carradine, John 

7th 

Miller, Harold (I) 

Kemp, Kenner G. 

Pleasence, Donald 

Flowers, Bess 

8th 

Kemp, Kenner G. 

Miller, Harold (I) 

Sayre, Jeffrey 

Mitchum, Robert 

9th 

Farnum, Franklyn 

Niven, David (I) 

Tovey, Arthur 

Harris, Sam (II) 

10th 

Niven, David (I) 

Tamiroff, Akim 

Hitler, Adolf 

Sayre, Jeffrey 


Year 

Nodes 

Edges 

Perf. ratio 

2000 

681 358 

33 564 142 

0.22% 

2005 

880 032 

41 079 259 

0.18% 

2010 

1237 879 

53 625 608 

0.19% 

2014 

1797 446 

72 880 156 
0.14% 

1st 

Lee, Christopher (I) 

Hitler, Adolf 

Hitler, Adolf 

Madsen, Michael (I) 

2nd 

Hitler, Adolf 

Lee, Christopher (I) 

Lee, Christopher (I) 

Trejo, Danny 

3rd 

Pleasence, Donald 

Steiger, Rod 

Hopper, Dennis 

Hitler, Adolf 

4th 

Welles, Orson 

Sutherland, Donald (I) 

Keitel, Harvey (I) 

Roberts, Eric (I) 

5th 

Quinn, Anthony (I) 

Pleasence, Donald 

Carradine, David 

De Niro, Robert 

6th 

Steiger, Rod 

Hopper, Dennis 

Sutherland, Donald (I) 

Dafoe, Willem 

7th 

Carradine, John 

Keitel, Harvey (I) 

Dafoe, Willem 

Jackson, Samuel L. 

8th 

Sutherland, Donald (I) 

von Sydow, Max (I) 

Caine, Michael (I) 

Keitel, Harvey (I) 

9th 

Mitchum, Robert 

Caine, Michael (I) 

Sheen, Martin 

Carradine, David 

10th 

Connery, Sean 

Sheen, Martin 

Kier, Udo 

Lee, Christopher (I) 















































